I’m excited about the course. I expect to learn how to use R. Reijo Sund told me about the course. https://github.com/elmeris/IODS-project # About the project
Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
Code for data wrangling: https://github.com/elmeris/IODS-project/blob/master/data/create_learning2014.R
reading data
learning2014 <- read.csv("~/IODS-project/data/learning2014.csv")
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
dim(learning2014)
## [1] 166 7
There is seven variables and 166 observations.
Graphical overview
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
Highest correlation is between points and attitude. All distributions looks even but age is converted to the right.
Regression model
The explanatory variables are “surface”, “attitude” and “strategy”
Scatter plot
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
a linear model
model <- lm(points ~1, data = learning2014)
Summary of the model
summary(model)
##
## Call:
## lm(formula = points ~ 1, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7169 -3.7169 0.2831 5.0331 10.2831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.7169 0.4575 49.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.895 on 165 degrees of freedom
Regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = learning2014)
Summary of the model2
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Lets move surf out because the p-vaöue is the highest.
New model
my_model3 <- lm(points ~ attitude + stra, data = learning2014)
summary
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Attitude is significant variable. Strategy is not statistically significant. R-squared:the higher the R-squared, the better the model fits your data. R-squared is always between 0 and 100%. Multiple R-squared: 0.2048. Means the model fits the data quite well.
Diagnostic plot
par(mfrow = c(2,2))
plot(my_model3, which = c(1,2,5))
QQ plot: model is normally distributed
Data wrangling I read the data from URL web page, joined the two data sets using the variables “school”, “sex”, “age”, “address”, “famsize”, “Pstatus”, “Medu”, “Fedu”, “Mjob”, “Fjob”, “reason”, “nursery”,“internet” as (student) identifiers. I Kept only the students present in both data sets. Then, I took the average of the answers related to weekday and weekend alcohol consumption to create a new column ‘alc_use’ to the joined data. Then I used ‘alc_use’ to create a new logical column ‘high_use’ which is TRUE for students for which ‘alc_use’ is greater than 2 (the range is: from 1 - very low to 5 - very high).
Code for data wrangling: https://github.com/elmeris/IODS-project/blob/master/data/create_alc.R
Reading data
alc <- read.csv("~/IODS-project/data/alc.csv")
head(alc)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason nursery internet guardian traveltime studytime failures
## 1 course yes no mother 2 2 0
## 2 course no yes father 1 2 0
## 3 other yes yes mother 1 2 2
## 4 home yes yes mother 1 3 0
## 5 home yes no father 1 2 0
## 6 reputation yes yes mother 1 2 0
## schoolsup famsup paid activities higher romantic famrel freetime goout
## 1 yes no no no yes no 4 3 4
## 2 no yes no no yes no 5 3 3
## 3 yes no yes no yes no 4 3 2
## 4 no yes yes yes yes yes 3 2 2
## 5 no yes yes no yes no 4 3 2
## 6 no yes yes yes yes no 5 4 2
## Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1 1 1 3 5 2 8 8 1.0 FALSE
## 2 1 1 3 3 7 8 8 1.0 FALSE
## 3 2 3 3 8 10 10 11 2.5 TRUE
## 4 1 1 5 1 14 14 14 1.0 FALSE
## 5 1 2 5 2 8 12 12 1.5 FALSE
## 6 1 2 5 8 14 14 14 1.5 FALSE
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Description of the data This data has 385 observations for 35 variables. This combined data is from two different schools and it is about 15-22 year old students success in mathematics and portuguese language. S The survey included the following variables: student’s school, student’s sex, student’s age, student’s home address, family size, parent’s cohabitation status, mother’s education, father’s education, mother’s job, father’s job, reason to choose this school, student’s guardian, home to school travel time, weekly study time, number of past class failures, extra educational support, family educational support, extra paid classes within the course subject (Math or Portuguese), extra-curricular activities, attended nursery school, wants to take higher education, Internet access at home, a romantic relationship, quality of family relationships, free time after school, going out with friends, workday alcohol consumption, weekend alcohol consumption, current health status, number of school absences and grades in the course subject, Math or Portuguese: first period grade, second period grade and final grade.
My analysis from data
I am interested in the relationships between high/low alcohol consumption and these four variables: student’s sex (male or female), number of school absences (numeric: from 0 to 93), final grade (numeric: from 0 to 20, output target), age - student’s age (numeric: from 15 to 22).
My hypotesis are: 1) Men consume more alcohol 2) Students with lower grades consume more alcohol 3) Students with more absences consume more alcohol 4) Older students consume more alcohol
Lets find out!
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob <fct> teacher, other, other, services, other, other, other,…
## $ reason <fct> course, course, other, home, home, reputation, home, …
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ guardian <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 1…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, …
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
gather(alc) %>% glimpse
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Observations: 13,370
## Variables: 2
## $ key <chr> "school", "school", "school", "school", "school", "school"…
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
library(dplyr); library(ggplot2)
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 11.4
## 2 F TRUE 42 11.7
## 3 M FALSE 112 12.2
## 4 M TRUE 72 10.3
As we can see in the table above that it seems that my first hypothesis was maybe right because men consume more alcohol than women. Yet I dont know is this difference statistically significant or not.
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("grade")
As the boxplot above shows the students with lower final grade consume more alcohol. This means my second hypothesis was maybe right. Yet I dont know is this difference statistically significant or not.
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
As the boxplot above shows the students with more absences consume more alcohol. This means my third hypothesis was maybe right. Yet I dont know is this difference statistically significant or not.
g3 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g3 + geom_boxplot() + ggtitle("Student age by alcohol consumption and sex")
As the boxplot above shows the older male students and younger female students consume more alcohol. This means my second hypothesis was maybe right with male students but not with female students. Yet I dont know is this difference statistically significant or not.
m <- glm(high_use ~ age + G3 + absences + sex, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + G3 + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2729 -0.8326 -0.6292 1.0442 2.1190
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.38339 1.84796 -1.831 0.067120 .
## age 0.13985 0.10328 1.354 0.175690
## G3 -0.06631 0.03624 -1.830 0.067277 .
## absences 0.08964 0.02309 3.882 0.000104 ***
## sexM 0.98580 0.24184 4.076 4.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 423.66 on 377 degrees of freedom
## AIC: 433.66
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) age G3 absences sexM
## -3.38338833 0.13985344 -0.06630516 0.08963638 0.98580237
m <- glm(high_use ~ age + G3 + absences + sex, data = alc, family = "binomial")
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03393229 0.0008683319 1.237502
## age 1.15010522 0.9401979286 1.410759
## G3 0.93584524 0.8710385897 1.004426
## absences 1.09377649 1.0474354900 1.146836
## sexM 2.67996134 1.6781593816 4.338864
The fitted logistic regression model demonstrate that two of my chosen variables: absences and sex are statistically significant in relation to alcohol consumption. These variables have a p-value of <0.001.
The two other variables: age and final grade did not show statistical significance in relation to alcohol consumption.
Let’s take a closer look at statistically significant variables. Absences: Odds ratio shows us that the odds of higher alcohol consumption increases 1.08 times when the number of absences increase one unit. (95% Cl 1.05-1.15)
Sex: Odds ratio shows us that the male students have higher alcohol consumption 2.7 times more likely than female students. (95% Cl 1.68-4.34)
Prediction of the probability of high_use
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 255 13
## TRUE 84 30
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66753927 0.03403141 0.70157068
## TRUE 0.21989529 0.07853403 0.29842932
## Sum 0.88743455 0.11256545 1.00000000
These tables shows us that in 339 false (low alcohol use) predictions 255 were really false, thats about 75%, and in 43 true (high alcohol use) predictions 30 were really true, thats about 70%.
Visualizing the prediction of the probability of high_use
library(dplyr); library(ggplot2)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
This plot shows us that the prediction of correct classification is quite good, but there is some misclassifications with this model.
I created a new RMarkdown file and saved it as an empty file named ‘chapter4.Rmd’. Then I included the file as a child file in my ‘index.Rmd’ file.
This week I used the Boston Housing Dataset for the exercise
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 506 14
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Description The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The Boston data frame has 506 rows and 14 columns. The following describes the dataset columns:
| VARIABLE | DESCRIPTION |
|---|---|
"CRIM" |
per capita crime rate by town |
"ZN" |
proportion of residential land zoned for lots over 25,000 sq.ft. |
"INDUS" |
proportion of non-retail business acres per town. |
"CHAS" |
Charles River dummy variable (1 if tract bounds river; 0 otherwise) |
"NOX" |
nitric oxides concentration (parts per 10 million) |
"RM" |
average number of rooms per dwelling |
"AGE" |
proportion of owner-occupied units built prior to 1940 |
"DIS" |
weighted distances to five Boston employment centres |
"RAD" |
index of accessibility to radial highways |
"TAX" |
full-value property-tax rate per $10,000 |
"PTRATIO" |
upil-teacher ratio by town |
"B - 1000(Bk - 0.63)^2" |
where Bk is the proportion of blacks by town |
"LSTAT" |
% lower status of the population |
"MEDV" |
Median value of owner-occupied homes in $1000’s |
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
## corrplot 0.84 loaded
Description The graphical overview: First graph gives us some info about which combination of varibles produces some kind of clusters. For example rad ageseems to produse some clusters.
The graphical display of a correlation matrix shows us the correlation between pairs of variables. For exemple the correlation between variables tax and rad is high.
Summaries of thevariables The summary shows the minimum and maximum values, quarters, and mean for each variable. For example for the variable the mean value is 0.5547.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
## [1] "matrix"
Description After standardization, the mean of all variables is zero. Also the minimum, maximum, and quarters values have changed.
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
## crime
## low med_low med_high high
## 127 126 126 127
Description I used the quantiles as the break points in the new crime categorical variable. This new variable has the following categories: low, med_low, med_high, high. The old crime rate variable is now dropped from the dataset.
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2227723 0.2722772 0.2425743 0.2623762
##
## Group means:
## zn indus chas nox rm
## low 1.04704668 -0.8787045 -0.09734683 -0.8884709 0.40473923
## med_low -0.09199095 -0.2796114 -0.05757815 -0.5631767 -0.13784544
## med_high -0.37173443 0.1800432 0.20976484 0.3922049 0.06498778
## high -0.48724019 1.0170298 -0.01233188 1.0533492 -0.38428085
## age dis rad tax ptratio
## low -0.8394818 0.9356588 -0.6934784 -0.7320738 -0.51783712
## med_low -0.3680236 0.3163312 -0.5527622 -0.4505124 -0.07520112
## med_high 0.3889162 -0.3535144 -0.4041220 -0.2892987 -0.30185184
## high 0.8083240 -0.8565108 1.6390172 1.5146914 0.78181164
## black lstat medv
## low 0.37093569 -0.73922357 0.480280927
## med_low 0.31752424 -0.17049535 -0.006532388
## med_high 0.07478961 0.03902381 0.174838730
## high -0.68363983 0.85375936 -0.669896541
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.139409527 0.8800816797 -0.97683565
## indus 0.025167820 -0.0816779181 0.03896636
## chas -0.003321766 -0.0408519038 0.05381218
## nox 0.460938597 -0.7529709833 -1.35302376
## rm 0.028766722 -0.0518005972 -0.17569388
## age 0.236578080 -0.0814869826 -0.35421863
## dis -0.043827086 -0.1916609414 -0.14683940
## rad 3.276054739 1.0202911026 -0.23574745
## tax -0.119406820 -0.2000674347 0.89051409
## ptratio 0.161191541 -0.0002976277 -0.26810131
## black -0.103927911 0.0344339502 0.11931968
## lstat 0.200052465 -0.3044146566 0.21412586
## medv 0.102332355 -0.4563299175 -0.30048359
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9506 0.0355 0.0139
Description 5.1 First we’ll see prior probabilities of groups of our new crime variable. For example prior probability of category high is 0.2425743.
Next we’ll see the means of our scaled variables in the four crime categories. For example mean of the zn in the category low is 0.96775094.
Then we’ll see coefficients of linear discriminants output which provides the linear combination of predictor variables that are used to form the LDA decision rule. For example LD1 for medv is 0.18357863.
In the last section we’ll see proportion of trace which is the proportion of crime categories that is explained by successive discriminant functions. As we can see the LD1 has highest proportion of trace.
5.2 From the graph we can see that correlation between rad zn and zn nox is pretty low. As we know already from the task 3 those correlations are : rad zn -0.31 zn nox -0.52 Also we can see that variable rad is the feature contributing to LD1 dimension.
## predicted
## correct low med_low med_high high
## low 18 17 2 0
## med_low 1 13 2 0
## med_high 0 8 19 1
## high 0 0 0 21
Describtion Here we can see that the prediction of the classes is correct in: low 61% of cases med_low41% of cases med_high78% of cases high100% of cases
library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Description Distance between the observations calculated with two different methods: Euclidian distances (Ordinary" straight-line distance between two points in Euclidean space) and Manhattan distances(The distance between two points measured along axes at right angles). This case Euclidian distance mean is 4.9111 and median 4.8241 and Manhattan distance mean is 13.5488 and median 12.6090.
Description For investigating what is the optimal number of clusters I will use the elbow method:
I will say that optimal number of clusters in this case is 2.
Description The graph shows us that which combinations of two variables distinguish the two clusters. For example
tax agedistinguish the two clusters quite well also taxwith many other variables distinguish the two clusters quite well.
Code for last weeks datawrangling is here: https://github.com/elmeris/IODS-project/blob/master/data/human.R
The data is combination of Human Development Index (HDI) data and Gender Inequality Index (GII) data. Data is collected by United Nations Development Programme.
The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. More info about HDI here: http://hdr.undp.org/en/content/human-development-index-hdi
The GII is an inequality index. It measures gender inequalities in three important aspects of human development—reproductive health, measured by maternal mortality ratio and adolescent birth rates; empowerment, measured by proportion of parliamentary seats occupied by females and proportion of adult females and males aged 25 years and older with at least some secondary education; and economic status, expressed as labour market participation and measured by labour force participation rate of female and male populations aged 15 years and older. More info about GII here: http://hdr.undp.org/en/content/gender-inequality-index-gii
Data has now 195 observations with 19 variables. There are also two new variables “Edu2.FM” and “Labo.FM”.
I have renamed the variables with (shorter) descriptive names:
| VARIABLE | NAME |
|---|---|
"HDI.Rank" |
— |
"Country" |
— |
"Human.Development.Index..HDI." |
"HDI" |
"Life.Expectancy.at.Birth" |
"Life.Exp" |
"Expected.Years.of.Education" |
"Edu.Exp" |
"Mean.Years.of.Education" |
"Edu.Mean" |
"Gross.National.Income..GNI..per.Capita" |
"GNI" |
"GNI.per.Capita.Rank.Minus.HDI.Rank" |
"GNI.Minus.Rank" |
"GII.Rank" |
— |
"Gender.Inequality.Index..GII." |
"GII" |
"Maternal.Mortality.Ratio" |
"Mat.Mor" |
"Adolescent.Birth.Rate" |
"Ado.Birth" |
"Percent.Representation.in.Parliament" |
"Parli.F" |
"Population.with.Secondary.Education..Female." |
""Edu2.F"" |
"Population.with.Secondary.Education..Male." |
"Edu2.M" |
"Labour.Force.Participation.Rate..Female." |
"Labo.F" |
"Labour.Force.Participation.Rate..Male." |
"Labo.M" |
"The ratio of Female and Male populations with secondary education in each country" |
"Edu2.FM" |
"the ratio of labour force participation of females and males in each country " |
"Labo.FM" |
Code for this week datawrangling is here: https://github.com/elmeris/IODS-project/blob/master/data/human2.R
Please review my datawrangling exercise for this week from link above. The data have now 155 observations and 8 variables as it should.
I think my datawrangling was successful. However, for the sake of certainty, I will upload wrangled data for this exercise. I will name this data human3.
human3 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep =",", header = T)
str(human3)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Looks as it should.
library(MASS)
library(tidyr)
library(stringr)
library(ggplot2)
library(GGally)
library(corrplot)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
ggpairs(human3)
cor(human3) %>% corrplot
summary(human3)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Describtion Graphs and correlation matrix visualization shows us correlations between the variables. For example: The strongest correlations are between: - expected years of education and life expectancy - maternal mortality ratio and adolescent birth rate The strongest negative correlations are between: - maternal mortality ratio and life expectancy - life expectancy and adolescent birth rate Summaries are shown in the table above.
pca_human1 <- prcomp(human3)
biplot(pca_human1, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
s<-summary(pca_human1)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
pca_pr_nonst <- round(100*s$importance[2,], digits = 1)
pc_lab_nonst <- paste0(names(pca_pr_nonst), " (", pca_pr_nonst, "%)")
biplot(pca_human1, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_nonst[1], ylab = pc_lab_nonst[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
human_std <- scale(human3)
pca_human2 <- prcomp(human_std)
s<-summary(pca_human2)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
pca_pr <- round(100*s$importance[2,], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
pca_human2
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
Describtion(task2/3/4)
In principal component analysis (PCA) on the not standardized human data it seems that GNI has the biggest variance. PC1 captures all the variance.
After standardisation PCA looks much better. Now it seems that GNI is not the only variable that stands out.
The results are different because PCA assumes that the variables with higher variance are more important for those with lower variance.
Now we can interpret results of PCA.
There is a small angle between Maternal mortality ratio and Adolescent Birth Rate point meaning they are correlated positively. Thease variables are strongly contribute to PC1. i.e. in the countries where maternal mortality ratio is high also adolescent birth rate is high.
There is a small angle between Female Labour Force Participation Rate and Percent Female Representation in Parliament meaning they are correlated positively with each other but negatively correlated with Female Population with Secondary Education, Gross National Income per Capita, Expected Years of Education and Life Expectancy at Birth. Thease variables are strongly contribute to PC2. i.e. in the countries where females are in the parliament they are active in labour force also.
There is a small angle between Female Population with Secondary Education, Gross National Income per Capita, Expected Years of Education and Life Expectancy at Birth meaning they are correlated positively with each other but and negatively correlated with maternal mortality ratio and adolescent birth rate. Thease variables are contributing negatively to PC1. i.e. in the countries where women have secondary education also GNI is higher, and expect years of education is higher and life expectancy is higher.
PC1: Most meaning full components are life expectancy and maternal mortality. These arrows goes other direction. i.e. in the countries where maternal mortality ratio is high the life life expectancy is low.
PC2: Most meaning full components are Female Labour Force Participation Rate and Percent Female Representation in Parliament. These arrows goes same direction. i.e. in the countries where females are in the parliament they are active in labour force also.
library(FactoMineR)
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
Describtion Data has 300 onservations and 36 variables. The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions). Rows represent the individuals, columns represent the different questions. The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables.
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300 6
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Describtion Now the data has 300 onservations and 6 variables. Graphs shows - most individuals drinks tea in bag form - most individuals tee alone - most individuals do not drink in lunch time - there is not so big difference between individuals who uses sugar and who doesnt - most individuals drinks Earl Grey - most individuals drink in the chain store
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")
Describtion First part (Eigenvalues) there is the proportion of the data variance explained by each 11 dimension. Dim 1: explains 15% of the varaince, Dim 2: 14%, Dim 3: 12%, and so on…
Second part (Categories (the 10 first)) there is the information about the contribution of the 10 first individuals to the first three dimensions.
Third part (the categorical variables) there is the squared correlation between each variable and the dimensions.
Biplot: - Variables like: other, chain store+tea shop, tea bag+unpackaged, lunch, lemon, green, unpackaged, tea shop are contributed to the Dim 2. - Variables like: black, tea bag, chain store are contributing negatively to Dim 1.